HuggingFaceDataset
HuggingFaceDataset
Bases: Dataset
Streaming dataset backed by a HuggingFace datasets source.
Each row produced by datasets.load_dataset is rendered through the
Jinja2 input_template / output_template to JSON, validated
against the corresponding DataModel (or synalinks.ChatMessages
when None), and accumulated into batches of size batch_size.
Each batch is yielded as (x, y) — numpy object arrays of
DataModel instances — matching the format synalinks'
GeneratorDataAdapter expects.
Templates should render to JSON. Use Jinja's tojson filter for
safe string escaping.
Example:
ds = synalinks.HuggingFaceDataset(
path="gsm8k",
name="main",
split="train",
input_data_model=MathQuestion,
input_template='{"question": {{ question | tojson }}}',
output_data_model=NumericalAnswer,
output_template='{"answer": {{ answer.split("####")[-1].strip() | tojson }}}',
batch_size=8,
)
program.fit(x=ds())
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The HuggingFace dataset repo / builder name (first
positional argument of |
required |
name
|
str
|
Optional. The dataset configuration name. |
None
|
split
|
str
|
Optional. The split to load (e.g. |
None
|
revision
|
str
|
Optional. The dataset revision (commit hash, branch, tag). |
None
|
streaming
|
bool
|
If |
True
|
input_data_model
|
DataModel
|
See |
None
|
input_schema
|
dict | str
|
See |
None
|
input_template
|
str
|
See |
None
|
output_data_model
|
DataModel
|
See |
None
|
output_schema
|
dict | str
|
See |
None
|
output_template
|
str
|
See |
None
|
batch_size
|
int
|
Examples per yielded batch. Defaults to |
1
|
limit
|
int
|
Optional. See |
None
|
repeat
|
int
|
See |
1
|
**kwargs
|
Any
|
Forwarded to |
{}
|
Source code in synalinks/src/datasets/huggingface_dataset.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
load_split(path, *, name=None, split, input_data_model, input_template, output_data_model=None, output_template=None, limit=None, **load_kwargs)
Materialize a single HF split into one (x, y) (or (x,)) pair.
A thin convenience wrapper around
HuggingFaceDataset(streaming=False).materialize() that takes
the same arguments as the HuggingFaceDataset constructor and
returns numpy object arrays directly.
Use this when you want a whole HF split as in-memory NumPy
arrays — for evaluation, head/tail train/test splits via
split_train_test, or quick experiments. For streaming use
cases, construct HuggingFaceDataset directly.